Author gender identification from text using Bayesian Random Forest
نویسندگان
چکیده مقاله:
Nowadays high usage of users from virtual environments and their connection via social networks like Facebook, Instagram, and Twitter shows the necessity of finding out shared subjects in this environment more than before. There are several applications that benefit from reliable methods for inferring age and gender of users in social media. Such applications exist across a wide area of fields, from personalized advertising to law enforcement of reputation management. Text posts represent a large portion of user generated content, and contain information which can be relevant to discovering undisclosed user attributes, or investigating the honesty of self-reported age and gender. Because the highest rate of information exchanges is in text format, author identification from the aspects like age, gender, political and religious opinions from these contents will seem more considerable. Gender identification that could be useful in security and marketing, also answers the following question: given a short text document, can we identify if the author is a male or a female? This question is motivated by recent events where people faked their gender on the Internet. In this paper, author gender identification in blog’s data is investigated. In this regard, four groups of features include syntactic features, word-based features, character-based features, and function words are employed. In addition, character n-gram features is used for improving the accuracy of classification. For evaluation of the proposed method, 3212 texts were collected from Technorati.com and blogger.com. Experimental results demonstrate that these types of features are practical. furthermore, a new classification method called "Bayesian Random Forest" is introduced. Each tree in Bayesian Random Forest is a Bayes tree. The results of experiment show that this method attains noticeable results in comparison with other classification algorithms such as Naïve Bayes, Naïve Bayes Tree, and Random Forest and it increases accuracy of gender identification to 89.5%.
منابع مشابه
Author Identification using Random Forest and Sequential Minimal Optimization
Author identification is a significant factor in the global economic loss due to computer-related crimes. According to the Center for Strategic and International Studies (CSIS), an estimated 375 to 575 billion dollars is lost each year due to computer or cybercrimes. Recently, various techniques have been used to improve the accuracy of author identification. In this paper, we propose combining...
متن کاملAuthor Identification: Using Text Mining, Feature Engineering & Network Embedding
Authorship analysis is a challenging area that has been developed through centuries and with research done widely scattered across multiple disciples of mainly computational linguistics, text mining, data mining, stylometry and machine learning. Conventional techniques from the past relied heavily on stylometry and text-based content analysis of document text for authorship analysis. More recen...
متن کاملBayesian Multinomial Logistic Regression for Author Identification
Motivated by high-dimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm.
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملUsing Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents
Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...
متن کاملAuthor Identification from Citations
Machine Learning techniques can be applied to citation data from a network of papers to predict the author of a paper that is currently outside of the network. Using a series of models we have found that we can increase the accuracy from past experiments with citation data, by considering the citations as a network. This allows us to predict with confidence the author of a blind paper.
متن کاملمنابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ذخیره در منابع من قبلا به منابع من ذحیره شده{@ msg_add @}
عنوان ژورنال
دوره 16 شماره 1
صفحات 143- 157
تاریخ انتشار 2019-06
با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.
کلمات کلیدی برای این مقاله ارائه نشده است
میزبانی شده توسط پلتفرم ابری doprax.com
copyright © 2015-2023